Fix various multicluster issues by andrewstucki · Pull Request #1480 · redpanda-data/redpanda-operator

andrewstucki · 2026-04-24T02:55:19Z

A batch of fixes surfaced during end-to-end stretch-cluster partition testing (3-cluster demo across EKS/AKS/GKE over Tailscale + Cilium ClusterMesh), plus tooling improvements to rpk k8s multicluster that remove the chicken-and-egg bootstrap cycle when leveraging publicly accessible load balancers and fix the TLS-SAN health check.

Bug fixes

Hot reconcile loop on healthy stretch clusters

Two root causes produced a steady stream of .status writes at the condition-heartbeat cadence:

Map iteration over features.Features produced a non-deterministic InUseFeatures ordering on each call, flipping LicenseValid's stored value on every reconcile. Fixed by sort.Strings(inUseFeatures) in both operator/internal/controller/redpanda/multicluster_controller.go and operator/internal/controller/redpanda/redpanda_controller.go.
Condition rate limit was 1 minute, causing the heartbeat to fire 60x/hour. Stretched to 5 minutes in operator/statuses.yaml and regenerated operator/internal/statuses/zz_generated_status.go. The forced-dirty write remains so lastUpdateTime on the condition stays meaningful.

NOTE: this also exists in the current Redpanda controller and will need to be partially backported.

`PodEndpoints` dropped during cross-cluster owner rebuild

In flat-network mode, StretchClusterOwnershipResolver.ResolveOwnerReference rebuilds the owner wrapper when the owner UID comes from a different k8s cluster than the one being reconciled. The rebuild used NewStretchClusterWithPools, which carries forward NodePools but not PodEndpoints. Result: each SyncAll iteration that targeted a peer cluster would render flat-mode per-pod Endpoints with an empty IP list, and the Syncer would GC existing cross-cluster Endpoints/EndpointSlices every reconcile cycle.

Fixed by preserving PodEndpoints on the new owner in operator/internal/lifecycle/stretch_cluster_ownership.go.

Observability

Trace logging added along the flat-mode per-pod Endpoints path so the next time this regresses we can see it in the logs rather than correlating timestamps with kubectl get endpoints:

operator/internal/lifecycle/client.go: FetchExistingAndDesiredPools and fetchExistingPools.
operator/internal/lifecycle/pool.go: PodEndpoints(ctx) — added ctx parameter, logs per-cluster pod/ready/no-IP counts.
operator/multicluster/endpoints.go: info-level log when rendering nil on empty podEndpoints (the smoking-gun signal for the GC-loop bug), plus trace logs for rendered counts and per-service skips.
operator/multicluster/render_state.go: RenderState.ctx field + WithContext/Context methods to plumb the reconcile-scoped logger down to render helpers.
operator/internal/lifecycle/stretch_cluster_simple_resources.go: threads reconcile ctx into the render state.

`rpk k8s multicluster status` — TLS SAN check correctness

operator/cmd/rpk-k8s/k8s/multicluster/checks/cluster_tls_san.go rewritten. The previous check substring-matched the cluster's logical name against cert.DNSNames (e.g., "two" against DNS SANs), so:

Clusters with DNS-only peer addresses could false-fail if the address didn't happen to contain the cluster name as a substring.
Clusters with IP-only peer addresses always failed because cert.IPAddresses wasn't consulted.

Now resolves the expected peer address from the Deployment's --peer=<self>://<addr>:9443 flag and calls x509.Certificate.VerifyHostname(addr) — the same routine Go's TLS stack uses during a real peer dial. Handles DNS wildcards, IP literals, and case normalization uniformly. Error messages include both DNSNames and IPAddresses so the cause is obvious without pulling the cert.

`rpk k8s multicluster bootstrap --loadbalancer`

New flag resolves the deploy/redeploy chicken-and-egg: previously, to get peer addresses into each cluster's cert SANs and --peer flags when leveraging an external load balancer, you'd install the operator, wait for the chart's LoadBalancer Service to provision, read the addresses, and redeploy with those values baked into helm values.

Now bootstrap --loadbalancer provisions a standalone peer-only LoadBalancer Service per cluster (name <fullname>-multicluster-peer, distinct from the chart's Service, labelled operator.redpanda.com/bootstrap-managed=true for discovery), waits for the provider to publish an address, and signs each cert with that address in the SANs. On completion it prints a ready-to-paste multicluster.peers block in helm-values shape.

New files:

pkg/multicluster/bootstrap/loadbalancer.go: EnsurePeerLoadBalancer — idempotent, re-runs reuse the existing Service and re-read the address so repeated invocations don't thrash cloud LB resources.
operator/cmd/rpk-k8s/k8s/multicluster/spinner.go: minimal TTY-aware spinner so the provisioning wait (often several minutes) has visible progress. Falls back to plain lines when stdout isn't a TTY.

Precedence when both --loadbalancer and --dns-override are specified: --dns-override wins for the cert SAN, LB provisioning is skipped for that cluster. If someone needs "always provision, but cert SAN from override" it's a follow-up that splits ServiceAddress into advertise-as vs cert-SAN fields.

Chart: `multicluster.service` addition.

We now have the ability to render out a service alongside each operator deployment that is configurable so that things like service meshes/MCS implementations can quickly figure out the discovery mechanisms for the operator. The service/MCS primitives are disabled by default, and if you want to have the operators communicate over a public network, the bootstrap process --loadbalancer flag should likely be used instead, though you can also combine provisioning a loadbalancer service here with something like external dns annotations to give it a pre-determined well-known hostname.

andrewstucki · 2026-04-24T02:56:07Z

See the NOTE: this also exists in the current Redpanda controller and will need to be partially backported. -- I'll open up individual "backport" PRs for this fix to our current target branches.

Fix multicluster issues

3bea0f7

andrewstucki requested review from RafalKorepta, chrisseto, gene-redpanda and hidalgopl as code owners April 24, 2026 02:55

andrewstucki changed the title ~~Fix multicluster issues~~ Fix various multicluster issues Apr 24, 2026

update hash

2da4548

This was referenced Apr 24, 2026

[release/v25.2.x] Backport deterministic sort #1481

Merged

[release/v25.3.x] Backport deterministic sort #1482

Merged

[release/v26.1.x] Backport deterministic sort #1483

Merged

hidalgopl approved these changes Apr 24, 2026

View reviewed changes

RafalKorepta reviewed Apr 24, 2026

View reviewed changes

Comment thread operator/chart/chart.go

RafalKorepta reviewed Apr 24, 2026

View reviewed changes

Comment thread operator/chart/service.go

RafalKorepta approved these changes Apr 24, 2026

View reviewed changes

Address PR feedback

e436686

andrewstucki enabled auto-merge (squash) April 24, 2026 12:52

re-run gen

6f82db3

andrewstucki merged commit 4432ce5 into main Apr 24, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various multicluster issues#1480

Fix various multicluster issues#1480
andrewstucki merged 4 commits intomainfrom
as/multicluster-resiliancy-and-bootstrap-fixes

andrewstucki commented Apr 24, 2026

Uh oh!

andrewstucki commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrewstucki commented Apr 24, 2026

Bug fixes

Hot reconcile loop on healthy stretch clusters

PodEndpoints dropped during cross-cluster owner rebuild

Observability

rpk k8s multicluster status — TLS SAN check correctness

rpk k8s multicluster bootstrap --loadbalancer

Chart: multicluster.service addition.

Uh oh!

andrewstucki commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`PodEndpoints` dropped during cross-cluster owner rebuild

`rpk k8s multicluster status` — TLS SAN check correctness

`rpk k8s multicluster bootstrap --loadbalancer`

Chart: `multicluster.service` addition.